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ABSTRACT 

Past research on probabilistic databases has studied the 
problem of answering queries on a static database. Ap- 
plication scenarios of probabilistic databases however often 
involve the conditioning of a database using additional in- 
formation in the form of new evidence. The conditioning 
problem is thus to transform a probabilistic database of pri- 
ors into a posterior probabilistic database which is material- 
ized for subsequent query processing or further refinement. 
It turns out that the conditioning problem is closely related 
to the problem of computing exact tuple confidence values. 

It is known that exact confidence computation is an NP- 
hard problem. This has led researchers to consider approx- 
imation techniques for confidence computation. However, 
neither conditioning nor exact confidence computation can 
be solved using such techniques. In this paper we present ef- 
ficient techniques for both problems. We study several prob- 
lem decomposition methods and heuristics that are based on 
the most successful search techniques from constraint satis- 
faction, such as the Davis-Putnam algorithm. We comple- 
ment this with a thorough experimental evaluation of the 
algorithms proposed. Our experiments show that our ex- 
act algorithms scale well to realistic database sizes and can 
in some scenarios compete with the most efficient previous 
approximation algorithms. 

1. INTRODUCTION 

Queries on probabilistic databases have numerous applica- 
tions at the interface of databases and information retrieval 
[13] . data cleaning [4], sensor data, tracking moving objects, 
crime fighting [5], and computational science |5]. 

A core operation of queries on probabilistic databases is 
the computation of confidence values of tuples in the result 
of a query. In short, the confidence in a tuple t being in the 
result of a query on a probabilistic database is the combined 
probability weight of all possible worlds in which t is in the 
result of the query. 

By extending the power of query languages for probabilis- 



tic databases, new applications beyond the mere retrieval of 
tuples and their confidence become possible. An essential 
operation that allows for new applications is conditioning, 
the operation of removing possible worlds which do not sat- 
isfy a given condition from a probabilistic database. Subse- 
quent query operations will apply to the reduced database, 
and a confidence computation will return conditional prob- 
abilities in the Bayesian sense with respect to the original 
database. Computing conditioned probabilistic databases 
has natural and important applications in virtually all areas 
in which probabilistic databases are useful. For example, in 
data cleaning, it is only natural to start with an uncertain 
database and then clean it - reduce uncertainty - by adding 
constraints or additional information. More generally, con- 
ditioning allows us to start with a database of prior proba- 
bilities, to add in some evidence, and take it to a posterior 
probabilistic database that takes the evidence into account. 

Consider the example of a probabilistic database of social 
security numbers (SSN) and names of individuals extracted 
from paper forms using OCR software. If a symbol or word 
cannot be clearly identified, this software will offer a number 
of weighted alternatives. The database 
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represents four possible worlds (shown in Figure [TJ, mod- 
elling that John has either SSN 1 or 7, with probability .2 
and .8 (the paper form may contain a hand-written symbol 
that can either be read as a European "I" or an Ameri- 
can "7"), respectively, and Bill has either SSN 4 or 7, with 
probability .3 and .7, respectively. We assume independence 
between John's and Bill's alternatives, thus the world in 
which John has SSN 1 and Bill has SSN 7 has probability 
.2 ■ .7 = .14. 

If A x denotes the event that Bill has SSN x, then P{Aa) = 
.3 and P(Ar) = .7. We can compute these probabilities in 
a probabilistic database by asking for the confidence values 
of the tuples in the result of the query 

select SSN, conf(SSN) from R where NAME = 'Bill'; 

which will result in the table 
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Now suppose we want to use the additional knowledge 
that social security numbers are unique. We can express this 
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Figure 1: The four worlds of the input database. 

using a functional dependency SSN — > NAME. Asserting 
this constraint, or conditioning the probabilistic database 
using the constraint, means to eliminate all those worlds in 
which the functional dependency does not hold. 

Let B be the event that the functional dependency holds. 
Conceptually, the database conditioned with B is obtained 
by removing world R 4 (in which John and Bill have the same 
SSN) and renormalizing the probabilities of the remaining 
worlds to have them again sum up to 1, in this case by divid- 
ing by .06 + .24 + .14 = .44. We will think of conditioning as 
an operation assert [B] that reduces uncertainty by declaring 
worlds in which B does not hold impossible. 

Computing tuple confidences for the above query on the 
original database will give us, for each possible SSN value 
x for Bill, the probabilities P(A X ), while on the database 
conditioned with B it will give a table of social security 
numbers x and conditional probabilities P(A X \ B). For 
example, the conditional probability of Bill having SSN 4 
given that social security numbers are unique is 

P(Aa A B) 



P{A 4 | B) 



P(B) 



.44 



Using this definition, we could alternatively have com- 
puted the conditional probabilities by combining the results 
of two confidence computations, 

select SSN, P1/P2 

from (select SSN, conf(SSN) PI from R, B 
where NAME = 'Bill'), 
(select conf() P2 from B); 

where B is a Boolean query that is true if the functional 
dependency holds on R. 

Unfortunately, both conditioning and confidence compu- 
tation are NP-hard problems. Nevertheless, their study is 
justified by their obvious relevance and applications. While 
conditioning has not been previously studied in the con- 
text of probabilistic databases, previous work on confidence 
computation has aimed at cases that admit polynomial-time 
query evaluation and at approximating confidence values [5]. 

Previous work often assumes that confidence values are 
computed at the end of a query, closing the possible worlds 
semantics of the probabilistic database and returning a com- 
plete, nonprobabilistic relation of tuples with numerical con- 
fidence values that can be used for decision making. In such 
a context, techniques that return a reasonable approxima- 
tion of confidence values may be acceptable. 

In other scenarios we do not want to accept approxi- 
mate confidence values because errors made while comput- 
ing these estimates aggregate and grow, causing users to 
make wrong decisions based on the query results. This is 
particularly true in compositional query languages for prob- 
abilistic databases, where confidence values computed in a 



subquery form part of an intermediate result that can be 
accessed and used for filtering the data in subsequent query 
operations [19] . 

Similar issues arise when confidence values can be inserted 
into the probabilistic database through updates and may be 
used in subsequent queries. For example, data cleaning is 
a scenario where we, on one hand, want to materialize the 
result of a data transformation in the database once and 
for all (rather than having to redo the cleaning steps every 
time a query is asked) and on the other hand do not want 
to store incorrect probabilities that may affect a very large 
number of subsequent queries. Here we need techniques for 
conditioning and exactly computing confidence values. 

Exact confidence computation is particularly important in 
queries in which confidence values are used in comparison 
predicates. For an example, let us add a third person, Fred, 
to the database whose SSN is either 1 or 4, with equal proba- 
bility. If we again condition using the functional dependency 
SSN —> NAME, we have only two possible worlds, one in 
which John, Bill, and Fred have social security numbers 1, 
7, and 4, respectively, and one in which their SSN are 7, 4, 
and 1. If we now ask for the social security numbers that 
are in the database for certain, 

select SSN from R where conf(SSN) = 1; 

we should get three tuples in the result. Monte Carlo simu- 
lation based approximation algorithms will do very badly on 
such queries. Confidence approximation using a Karp-Luby- 
style algorithm 17, 9, 21, will independently underestimate 
each tuple's confidence with probability ~.5. Thus the prob- 
ability that at least one tuple is missing from the result of 
such a query is very high (see also [19] . 

In this paper, we develop efficient algorithms for com- 
puting exact confidences and for conditioning probabilistic 
databases. The detailed contributions are as follows. 

• In most previous models of probabilistic databases over 
finite world-sets, computing tuple confidence values es- 
sentially means the weighted counting of solutions to 
constraint structures closely related to disjunctive nor- 
mal form formulas. Our notion of such structures are 
the world-set descriptor sets, or ws-sets for short. We 
formally introduce a probabilistic database model that 
is known to cleanly and directly generalize many pre- 
viously considered probabilistic database models (cf. 
[3]) including, among others, various forms of tuple- 
independence models [9j [2], ULDBs [5], product de- 
composition T, and c-table-based models [3]. We use 
this framework to study exact confidence computation 
and conditioning. The results obtained are thus of im- 
mediate relevance to all these models. 

• We study properties of ws-sets that are essential to 
relational algebra query evaluation and to the design 
of algorithms for the two main problems of the paper. 

• We exhibit the fundamental, close relationship between 
the two problems. 

• We develop ws-trees, which capture notions of struc- 
tural decomposition of ws-sets based on probabilistic 
independence and world-set disjointness. Once a ws- 
tree has been obtained for a given ws-set, both exact 
confidence computation and conditioning are feasible 
in linear time. The main problem is thus to efficiently 
find small ws-tree decompositions. 
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Figure 2: Probabilistic database with ws-descriptors 
made explicit and denned by world-table W. 



• To this end, we develop a decomposition procedure 
motivated by the Davis-Putnam (DP) procedure for 
checking Propositional Satisfiability [12] , DP, while 
many decades old, is still the basis of the best exact 
solution techniques for the NP-complete Satisfiability 
problem. We introduce two decomposition rules, vari- 
able elimination (the main rule of DP) and a new in- 
dependence decomposition rule, and develop heuristics 
for chosing among the rules. 

• We develop a database conditioning algorithm based 
on ws-tree decompositions and prove its correctness. 

• We study ws-set simplification and elimination tech- 
niques that can be either used as an alternative to the 
DP-based procedure or combined with it. 

• We provide a thorough experimental evaluation of the 
algorithms presented in this paper. We also experi- 
mentally compare our exact techniques for confidence 
computation with approximation based on Monte Carlo 
simulation. 

The structure of the paper follows the list of contributions. 

2. PROBABILISTIC DATABASES 

We define sets of possible worlds following U-relational 
databases [3]. Consider a finite set of independent random 
variables ranging over finite domains. Probability distri- 
butions over the possible worlds are defined by assigning a 
probability P({x i— » i}) to each assignment of a variable x to 
a constant of its domain, i £ Dom^, such that the probabili- 
ties of all assignments of a given variable sum up to one. We 
represent the set of variables, their domains, and probabil- 
ity distributions relationally by a world-table W consisting 
of all triples (x,i,p) of variables x, values i in the domain of 
x, and the associated probabilities p = P({x i— > i}). 

A world-set descriptor is a set of assignments x \-> i with 
i £ Dom! that is functional, i.e. a partial function from vari- 
ables to domain values. If such a world-set descriptor d is a 
total function, then it identifies a possible world. Otherwise, 
it denotes all those possible worlds u>(d) identified by total 
functions / that can be obtained by extension of d. (That 
is, for all x on which d is defined, d(x) — f(x).) Because of 
the independence of the variables, the aggregate probability 
of these worlds is 

P(d)= II Pdx^i})- 

{xi — >i} Cd 

If d = 0, then d denotes the set of all possible worlds. 

We say that two ws-descriptors di and di are consistent 
iff their union (as sets of assignments) is functional. 

A ws-set is a set of ws-descriptors S and represents the 
world-set computed as the union of the world-sets repre- 
sented by the ws-descriptors in the set. We define the se- 



mantics of ws-sets using the (herewith overloaded) function 

u> extended to ws-sets, lo(S) := U (ui(d)). 

des 

A U-relation over schema E and world-table W is a set 
of tuples over E, where we associate to each tuple a ws- 
descriptor over W. A probabilistic database over schema 
{Ei, . . . , E n } and world-table W is a set of n U-relations, 
each over one schema Ei and W . A probabilistic database 
represents a set of databases, one database for each possible 
world defined by W . To obtain a possible world in the rep- 
resented set, we first choose a total valuation / over W. We 
then process each probabilistic relation Ri tuple by tuple. 
If / extends the ws-descriptor d of a tuple t, then t is in the 
relation Ri of that database. 

Example 2.1. Consider again the probabilistic database 
of social security numbers and names given in Figure [T] Its 
representation in our formalism is given in Figure [2] The 
world-table W of Figure [2] defines two variables j and b 
modeling the social security numbers of John and Bill, with 
domains {1,7} and {4,7} respectively. The probability of 
the world defined by / = { j i-> 7, b ^ 7} is .8 ■ .7 = .56. The 
total valuation / extends the ws-descriptors of the second 
and fourth tuple of relation Ur, thus the relation R in world 
/ is { (7, John), (7, Bill) }. □ 

Remark 2.2. Leaving aside the probability distributions 
of the variables which are represented by the W table, U- 
relations are essentially restricted c-tables [16] in which the 
global condition is "true", variables must not occur in the 
tuples, and each local condition must be a conjunction of 
conditions of the form x = v where x is a variable and v is 
a constant. Nevertheless, it is known that U-relations are a 
complete representation system for probabilistic databases 
over nonempty finite sets of possible worlds. 

U-relations can be used to represent attribute-level uncer- 
tainty using vertically decomposed relations. For details on 
this, we refer to [3|- All results in this paper work in the 
context of attribute-level uncertainty. 

The efficient execution of the operations of positive rela- 
tional algebra on such databases was described in that paper 
as well. Briefly, if U-relations Ur and Us represent relations 
R and 5*, then selections a^R and projections n^R sim- 
ply translate into c^Ur and k W sd ~aUr, respectively. Joins 
7? 1X10 S translate into Ur ixi^ai/j Us where ip is the condi- 
tion that the ws-descriptors of the two tuples compared are 
consistent with each other (i.e., have a common extension 
into a total valuation). The set operations easily follow from 
the analogous operations on ws-sets that will be described 
below, in Section T3. 21 □ 

Example 2.3. The functional dependency SSN -» NAME 
on the probabilistic database of Figure [2] can be expressed 
as a boolean relational algebra query as the complement of 
7r (i? M R) where <j> := (l.SSN = 2.SSN A l.NAME 
2.NAME). We turn this into the query 

1^WSd{Ur OOfiAl.WSD consistent with 2.WSD U r) . 

over our representation, which results in the ws-set {j ^ 
7, b i— > 7}. The complement of this with the world-set given 
by the W relation, {{j ^ 1}, {j 7}, {6 h-> 4}, {6 i-v 7}}, 
is {{j l— * 1}> {j ' — * 7, b i — ► 4}}. (Note that this is just one 
among a set of equivalent solutions.) □ 



3. PROPERTIES OF WS-DESCRIPTORS 

In this section we investigate properties of ws-descriptors 
and show how they can be used to efficiently implement 
various set operations on world-sets without having to enu- 
merate the worlds. This is important, for such sets can be 
extremely large in practice: [J] |3] report on experiments 
with IO 106 worlds. 

3.1 Mutex, Independence, and Containment 

Two ws-descriptors di and d,2 are (1) mutually exclusive 
(mutex for short) if they represent disjunct world-sets, i.e., 
u(di) fl u){di) — 0, and (2) independent if there is no valu- 
ation of the variables in one of the ws-descriptors that re- 
stricts the set of possible valuations of the variables in the 
other ws-descriptor (that is, di and (fe are defined only on 
disjoint sets of variables). A ws-descriptor di is contained 
in d2 if the world-set of di is contained in the world-set of 
dz, i.e., to(di) C u>(d2). Equivalence is mutual containment. 

Although ws-descriptors represent very succinctly possi- 
bly very large world-sets, all aforementioned properties can 
be efficiently checked at the syntactical level: d\ and d2, 
where all variables with singleton domains are eliminated, 
are (1) mutex if there is a variable with a different assign- 
ment in each of them, and (2) independent if they have no 
variables in common; di is contained in d^ if d\ extends d^. 

Example 3.1. Consider the world-table of Figure [5] and 
the ws-descriptors d\ — {j 1— > 1}, di = {j 1— > 7}, d?, = 
{j 1 — ► 1 , b 1 — ► 4}, and di = {b 4}. Then, the pairs (di, di) 
and (d2,ds) are mutex, d^ is contained in di, and the pairs 
(di,d4) and (d2,d4) are independent. □ 

We also consider the mutex, independence, and equiva- 
lence properties for ws-sets. Two ws-sets Si and S2 are 
mutex (independent) iff di and d2 are mutex (independent) 
for any di £ Si and d2 € 5*2. Two ws-sets are equivalent if 
they represent the same world-set. 

Example 3.2. We continue Example 13.11 The ws-set 
{di} is mutex with {d2}. {di, d2} is independent from {d^}. 
At a first glance, it looks like {di,d2} and {d3,d.i} are nei- 
ther mutex nor independent, because d\ and dz overlap. 
However, we note that ds C di and then uj{{ds,,di\) — 
u({d4}) and {di} is independent from {di,d2j-. □ 

3.2 Set Operations on ws-sets 

Various relevant computation tasks, ranging from deci- 
sion procedures like tuple possibility [T] to confidence com- 
putation of answer tuples, and conditioning of probabilistic 
databases, require symbolic manipulations of ws-sets. For 
example, checking whether two tuples of a probabilistic re- 
lation can co-occur in some worlds can be done by intersect- 
ing their ws-descriptors; both tuples co-occur in the worlds 
defined by the intersection of the corresponding world-sets. 

We next define set operations on ws-sets. 

• Intersection. Intersect ( Si, S2) := 

{di fl I di £ Si, di £ S2, di is consistent with d2}. 

• Union. Union(Si,S 2 ) := Si U S 2 . 

• Difference. The definition is inductive, starting with 
singleton ws-sets. If ws-descriptors di and d2 are in- 



consistent, Diff({di}, {d2}) := {di}. Otherwise, 

Diff({di},{d 2 }) := 

{di U {£1 1 ► Wi,. . . ,Xi-! 1 ► Wi-i, Xi 1 ► w'i} I 
d2 — di = {xi >-ni)i,...,Xk i-> ^fc}, 

1 < i < k,w[ £ dom Xi , Wi 7^ w'i}. 

Diff({d 1 }, S U {d 2 }) := Diff(Diff({d 1 }, S), {d 2 }). 
Diff({di,...,d„},S) := (J Diff({d l },S). 

l<i<n 

Example 3.3. Consider di = {j h-» 1}, d2 = {j 1— > 7}, 
and d3 = {j 1— > 1, 6 1— > 4}. Then, Intersect({di}, {d2}) = 
Intersect({d2}, {ds}) = because d2 is inconsistent with di 
and d$. Intersect({di}, {ds}) = {ds}, because d3 is con- 
tained in di. Diff({d 2 },{di}) = Diff({d 2 },{d 3 }) = {d 2 } be- 
cause d2 is mutex with di and d^. Diff({di}, {di}) — {{j 1— > 
1 , b 1 ► 7}}. Diff({d3}, {di}) = {d3}, because ds and di are 
inconsistent. □ 

Proposition 3.4. The above definitions of set operations 
on ws-sets are correct: 

1. u)(Union(Si,S 2 )) = w(Si) Uw(S 2 ). 

2. Lo(Intersect(S 1 ,S 2 )) = u(Si) H w(S 2 ). 

3. uj{Diff{S 1 ,S 2 ))=Lo{S 1 )~uj{S 2 ). 

The ws-descriptors in Diff(Si, S2) are pairwise mutex. 

4. WORLD-SET TREES 

The ws-sets have important properties, like succinctness, 
closure under set operations, and natural relational encod- 
ing, and [3] employed them to achieve the purely relational 
processing of positive relational algebra on U-relational da- 
tabases. When it comes to the manipulation of probabilities 
of query answers or of worlds violating given constraints, 
however, ws-sets are in most cases inadequate. This is be- 
cause ws-descriptors in a ws-set may represent non-disjoint 
world-sets, and for most manipulations of probabilities a 
substantial computational effort is needed to identify com- 
mon world-subsets across possibly many ws-descriptors. 

We next introduce a new compact representation of world- 
sets, called world-set tree representation, or ws-tree for short, 
that makes the structure in the ws-sets explicit. This rep- 
resentation formalism allows for efficient exact probability 
computation and conditioning and has strong connections 
to knowledge compilation, as it is used in system modelling 
and verification [11] , There, too, various kinds of decision 
diagrams, like binary decision diagrams (BDDs) [7], are em- 
ployed for the efficient manipulation of propositional formu- 
las. 

Definition 4.1. Given a world-table W, a ws-tree over 
W is a tree with inner nodes <g> and ffi, leaves holding the 
ws-descriptor 0, and edges annotated with weighted variable 
assignments consistent with W. The following constraints 
hold for a ws-tree: 

• A variable defined in W occurs at most once on each 
root-to-leaf path. 
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Figure 3: World-set table W, a ws-tree R over W, and an equivalent ws-set S. 



• Each of its ©-nodes is associated with a variable v such 
that each outgoing edge is annotated with a different 
assignment of v. 

• The sets of variables occurring in the subtrees rooted 
at the children of any C8>-node are disjoint. □ 

We define the semantics of ws-trees in strict analogy to 
that of ws-sets based on the observation that the set of edge 
annotations on each root-to-leaf path in a ws-tree represents 
a ws-descriptor. The world-set represented by a ws-tree is 
precisely represented by the ws-set consisting of the anno- 
tation sets of all root-to-leaf paths. The inner nodes have 
a special semantics: the children of a (g)-node use disjoint 
variable sets and are thus independent, and the children of 
a ©-node follow branches with different assignments of the 
same variable and are thus mutually exclusive. 

Example 4.2. Figure [3] shows a ws-tree and the ws-set 
consisting of all its root-to-leaf paths. □ 

4.1 Constructing world-set trees 

The key idea underlying our translation of ws-sets into 
ws-trees is a divide-and-conquer approach that exploits the 
relationships between ws-descriptors, like independence and 
variable sharing. 

Figure [3] gives our translation algorithm. We proceed re- 
cursively by partitioning the ws-sets into independent dis- 
joint partitions (when possible) or into (possibly overlap- 
ping) partitions that are consistent with different assign- 
ments of a variable. In the case of independent partitioning, 
we create ©-nodes whose children are the translations of the 
independent partitions. In the second case, we simplify the 
problem by eliminating a variable: we choose a variable x 
and create an ©-node whose outgoing edges are annotated 
with different assignments x i— > i of x and whose children 
are translations of the subsets of the ws-set consisting of ws- 
descriptors consistent with x t— > O If at any recursion step 
the input ws-set contains the miliary ws-descriptor, which 
by definition represents the whole world-set, then we stop 
from recursion and create a ws-tree leaf 0. This can happen 
after several variable elimination steps that reduced some of 
the input ws-descriptors to 0. 



ComputeTree (WS-Set S) returns WS-Tree 
if (S = 0) then return 1 

else if (0 € S) //S contains a universal ws-desc 

then return 
else choose one of the following: 

(independent partitioning) 

if there are non-empty and independent ws-sets 
Si, ... , such that S = Si U ■ ■ ■ U S\i\ 

then return (££)(ComputeTree(Si)) 

iei 

(variable elimination) 
choose a variable x in 5; 
T := {d | d 6 S, jBi G doim, :{iKi}C d}; 
Vi G domx ■ S x ^i := {{j/i h-> j%, . . . , y m ^ j m } \ 
{x h-> 1,2/j >->• j lt y m i-> j m } e S}; 

return £f) (x^i: ComputeTree(S^ U T)) 



x Our translation abstracts out implementation details. For 
instance, for those assignments of x that do not occur in S 
we have T U S x ^i = T and can translate T only once. 



Figure 4: Translating ws-sets into ws-trees. 



Example 4.3. We show how to translate the ws-set 5* 
into the ws-tree R (Figure [3|. We first partition S into 
two (minimally) independent ws-sets Si and 5*2 : Si consists 
of the first three ws-descriptors of S, and S2 consists of the 
remaining two. For Si , we can eliminate any of the variables 
x, y, or z. Consider we choose x and create two branches 
for x 1 — ► 1 and x 1— > 2 respectively (there is no ws-descriptor 
consistent with x 3). For the first branch, we stop with 
the ws-set {0}, whereas for the second branch we continue 
with the ws-set {{y 1— > 1}, {z 1}}. The latter ws-set can 
be partitioned into independent subsets in the context of the 
assignment x 1— > 2. We proceed similarly for S2 and choose 
to eliminate variable u. We create an ©-node with outgoing 
edges for assignments u h> 1 and u 2 respectively. We 
are left in the former case with the ws-set {{v 1— > 1}} and in 
the latter case with {0}. 

Different variable choices can lead to different ws-trees. 
This is the so-called variable ordering problem that applies 
to the construction of binary decision diagrams. Later in 




Estimate (WS-Set S, variable x in S) returns Real 

missing_assignment := false; 
foreach i £ dom^ do 

compute S IM j and T as shown in Figure [4] 

if \S x ^i\ > then Sl = \S x „i U T\ 

else Si — 0; missing_assignment = true; endif 
if (missing_assignment) then e = \T\ else e = 
foreach j £ dom^ such that Sj > do 

e = e + log fc (l + fc s - e ) 
return e 

Figure 6: Log cost estimate for a variable choice. 



Figure 5: A ws-tree equivalent to R of Figure [3j 

this section we discuss heuristics for variable orderings. □ 

Theorem 4.4. Given a ws-set S , ComputeTree(S) and S 
represent the same world-set. 

Our translation can yield ws-trees of exponential size (sim- 
ilar to BDDs). This rather high worst-case complexity needs 
to be paid for efficient exact probability computation and 
conditioning. It is known that counting models of preposi- 
tional formulas and exact probability computation are #P- 
hard problems [9] . This complexity result does not preclude, 
however, BDDs from being very successful in practice. We 
expect the same for ws-trees. The key observation for a good 
behaviour in practice is that we should partition ws-sets into 
independent subsets whenever possible and we should care- 
fully choose a good ordering for variable eliminations. Both 
methods greatly influence the size of the ws-trees and the 
translation time, as shown in the next example. 

Example 4.5. Consider again the ws-set S of Figure [3] 
and a different ordering for variable eliminations that leads 
to the ws-tree of Figure [5] We shortly discuss the construc- 
tion of this ws-tree. Assume we choose to eliminate the 
variable y and obtain the ws-sets 

S yM 2 = {{x \— > 1}, {ih2,zh 1}, {uk1,«h 1}, {u i ► 2}} 

Sy„l = Sy-rt U {{X I y 2}} 

In contrast to the computation of the ws-tree 7? of Figure [3] 
our variable choice creates intermediary ws-sets that overlap 
at large, which ultimately leads to a large increase in the size 
of the ws-tree. This bad choice need not necessarily lead to 
redundant computation, which we could easily detect. In 
fact, the only major savings in case we detect and eliminate 
redundancy here are the subtrees a and /3, which still leave 
a graph larger than R. □ 

4.2 Heuristics 

We next study heuristics for variable elimination and in- 
dependent partitioning that are compared experimentally in 
Section [7] We devise a simple cost estimate, which we use 
to decide at each step whether to partition or which vari- 
able to eliminate. We assume that, in worst case, the cost 
of translating a ws-set S is 2' s (following the exponential 
formula of the inclusion-exclusion principle). 

In case of independent partitioning, the partitions 5*1 , ... , 
S n are disjoint and can be computed in polynomial time (by 



computing the connected components of the graph of vari- 
ables co-occurring within ws-descriptors). We thus reduce 

the computation cost from 2 |s| to 2 |Sl1 H h 2 |Snl . This 

method is, however, not always applicable and we need to 
apply variable elimination. 

The main advantage of variable elimination is that S is 
divided into subsets T U S x ^i without the dependencies en- 
forced by variable x and thus subject to independent parti- 
tioning in the context of x i— > i. Consider s; the size of the 
ws-set TU Sxt-^i- Then, the cost of choosing a; is E 2 s *. 

ifzdom^ 

Of course, for those assignments of x that do not occur in 
S we have T U Sxi-*i = T and can translate T only once. 
The computation cost using variable elimination can match 
that of independent partitioning only in the case that the 
assignments of the chosen variable partition the input ws-set 
S and thus T is empty. 

Our first heuristic, called minlog, chooses a variable that 

dom^ 

minimizes log( E 2 s '). Figure [6] shows how to compute 

i — 1 

incrementally the cost estimate by avoiding summation of 
potentially large numbers. The variable missing_assignment 
is used to detect whether there is at least one assignment of 
x not occuring in S for which T will be translated; in this 
case, T is only translated once (and not for every missing 
assignment). 

The second heuristic, called minmax, approximates the 
cost estimate and chooses a variable that minimizes the max- 
imal ws-set TUS x ^i. Both heuristics need time linear in the 
sizes of all variable domains plus of the ws-set. In addition 
to minmax, minlog needs to perform log and exp operations. 

Remark 4.6. To better understand our heuristics, we 
give one scenario where minmax behaves suboptimal. Con- 
sider S of size n and two variables. Variable x occurs with 
the same assignment in n— 1 ws-descriptors and thus its min- 
max estimate is n, and variable y occurs twice with different 
assignments, and thus its minmax estimate is n — 1. Using 
minmax, we choose y, although the minlog would choose dif- 
ferently: e(y) = log(2-2 n - 1 +2 n " 2 ) > log(2-2 n - 1 ) = e{x).a 

4.3 Probability computation 

We next give an algorithm for computing the exact proba- 
bility of a ws-set by employing the translation of ws-sets into 
ws-trees discussed in Section [4] Figure [7] defines the func- 
tion P to this effect. This function is defined using pattern 
matching on the node types of ws-trees. The probability of 
an Cgi-node is the joint probability of its independent children 
Si, ... , S\i\. The probability of an ©-node is the joint prob- 
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Figure 7: Probability computation for ws-trees. 

ability of its mutually exclusive children, where the proba- 
bility of each child Si is weighted by the probability of the 
variable assignment x % annotating the incoming edge 
of Si. Finally, the probability of a leaf represented by the 
nullary ws-descriptor is 1 and of _L is 0. 

Example 4.7. The probability of the ws-tree R of Fig- 
ure [3] can be computed as follows (we label the inner nodes 
with I for left child and r for right child): 

P(P) = l-(l-P(/)).(l-P(r)) 
P(l) = P({x i-> 1}) ■ P(0) + P({x i — ► 2}) ■ P(lr) 

PQ r) = i_(i_ P({ y i ► 1}) . P(0)) . (1 - P({ 2 _> 1}) ■ P(0)) 
P(r) = P{{u i — ^ 1}) ■ P({t; i-> 1}) ■ P(0) + P({k i ► 2}) ■ P(0) 

We can now replace the probabilities for variable assign- 
ments and ws-descriptor and obtain 

P(r) = 0.7-0.5 ■ 1 + 0.3 = 0.65 
P(lr) = 1 - (1 - 0.2 ■ 1) ■ (1 - 0.4 • 1) = 0.52 
P(l) = 0.1 ■ 1 + 0.4 • 0.52 = 0.308 

P(R) = 1 - (1 - 0.308) ■ (1 - 0.65) = 0.7578 □ 

The probability of a ws-tree R can be computed in one 
bottom-up traversal of R and does not require the precom- 
putation of R. The translation and probability computa- 
tion functions can be easily composed to obtain the func- 
tion ComputeTree o P by inlining P in ComputeTree. As a 
result, the construction of the nodes ©, <g>, and is replaced 
by the corresponding probability computation given by P. 

5. CONDITIONING 

In this section we study the problem of conditioning a 
probabilistic database, i.e., the problem of removing all pos- 
sible worlds that do not satisfy a given condition (say, by 
a Boolean relational calculus query) and renormalizing the 
database such that, if there is at least one world left, the 
probability weights of all worlds sum up to one. 

We will think of conditioning as a query or update op- 
eration assert^, where (j> ls the condition, i.e., a Boolean 
query. Processing relational algebra queries on probabilistic 
databases was discussed in Section [2] We will now assume 
the result of the Boolean query given as a ws-set defining 
the worlds on which cj> is true. 

Example 5.1. Consider again the data cleaning exam- 
ple from the Introduction, formalized by the U-relational 
database of Figure [2] Relation W represents the set of pos- 
sible worlds and U represents the tuples in these worlds. 

As discussed in Example 12.31 the set of ws-descriptors 
S = {{j ^ 1}) {j i — > 7, f> i — » 4}} represents the three worlds 
on which the functional dependency SSN — > NAME holds. 



cond : conditioning algorithm 

In: ws-tree R representing the new nonempty world-set 

ws-set U from the U-relations 
Out: (confidence value, ws-set U') 

if R = then return (1,U); 

if R = ® then 

iei 

foreach i £ I do (a, Ui) := cond(Pi, U); 
return (1 - ri.(l - Ci ), \JUi); 

if R = © (iwi: Pi) then 
i6dom x 
foreach i £ dom^ do 

Ui := the subset of U consistent with x i— > i; 
(ci,Ui) :=cond(P l ,C/ l ); 

c ~ Y.iei p {{ x ^ 0) • c >; 

let x' be a new variable; 

foreach i € dom^ such that a 7^ do 

add (x',i, Eiil^HlSl) to the W relation; 



replace each occurrence of x in Ui by x'\ 
return (c, (J U[); 
iedom x 



Figure 8: The conditioning algorithm. 



The world { j 1— > 7, b t— > 7} is excluded and thus the confi- 
dence of S does not add up to one but to .2 + .8 ■ .3 = .44. 
What we now want to do is transform this database into 
one that represents the three worlds identified by S and 
preserves their tuples as well as their relative weights, but 
with a sum of world weights of one. This can of course be 
easily achieved by multiplying the weight of each of the three 
remaining worlds by 1/.44. However, we want to do this in a 
smart way that in general does not require to consider each 
possible world individually, but instead preserves a succinct 
representation of the data and runs efficiently. 

Such a technique exists and is presented in this section. It 
is based on running our confidence computation algorithm 
for ws-trees and, while returning from the recursion, renor- 
malizing the world-set by introducing new variables whose 
assignments are normalized using the confidence values ob- 
tained. For this example, the conditioned database will be 
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Note that the W relation actually models four possible worlds, 
but two of them, {j' 7, b 1— > 4} and {j' 1— > 7, b 7} are 
equal (contain the same tuples). Example 15.21 will show in 
detail how conditioning works. □ 

Figure [8] gives our efficient algorithm for conditioning a 
U-relational database. The input is a U-relational database 
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Figure 9: U-relation U, additions AW to the W-relation, and a renormalized ws-tree. 



and a ws-tree R that describes the subset of the possible 
worlds of the database that we want to condition it to. The 
output is a modified U-relational database and, as a by- 
product, since we recursively need to compute confidences 
for the renormalization, the confidence of R in the input 
database. The confidence of R in the output database will 
of course be f . The renormalization works as follows. The 
probability of each branch of an inner node n of R is re- 
weighted such that the probability of n becomes f . We re- 
flect this re-weighting by introducing new variable whose 
assignments reflect the new weights of the branches of n. 

This algorithm is essentially the confidence computation 
algorithm of Figure [7] We just add some lines of code along 
the line of recursively computing confidence that renormal- 
ize the weights of alternative assignments of variables for 
which some assignments become impossible. Additionally, 
we pass around a set of ws-descriptors (associated with tu- 
ples from the input U-relational database) and extend each 
ws-descriptor in that set by x i— > i whenever we eliminate 
variable x, for each of its alternatives i. 

Example 5.2. Consider the U-relational database con- 
sisting of the VK-relation of Figure [3] and the U-relation U of 
Figure[9] Let us run the algorithm to condition the database 
on the ws-tree R of Figure [3] (R need not be precomputed 
for conditioning). 

We recursively call function cond at each node in the ws- 
tree R starting at the root. To simplify the explanation, let 
us assume a numbering of the nodes and of the ws-sets we 
pass around: If R w is a (sub)tree then R m ,i is its 2-th child. 
The ws-set passed in the recursion with R w is U w and the 
ws-set returned is U' w . The ws-sets passed on at the nodes 
of R are: 

Ui = U 2 = U 
Ui t i = x *—> 1 : U = {{x i — ► f , y i — ► 2, u i — ► f}, 

{x i— > 1, u i— ► 1, v i— ► 2}} 
Ui, 2 = x^2:U = {{x^2,y^2,u^l}, 
{x i— ► 2, u i— * 1, v i— * 2}} 
£/i,2,i,i = y i-> f : (7i,2 = {{y >-> l,x h-> 2, u ^ 1, v h-> 2}} 
Ui t 2,2,1 = 1 : [/i,2 = {{zn 1,1 n2,yn 2,un 1}, 
{z i — ► 1, x i—* 2,u i — ► l,v i—* 2}} 

U'2,1 = U I— » 1 : U'2 — Ui 

U 2 ,2=u^2:U 2 = 9 
U 2 ,i,i = v 1 : U 2 ,i = {{v i ► 1, y i > 2, u I — > 1}} 



When we reach the leaves of R, we start returning from 
recursion and do the following. We first compute the prob- 
abilities of the nodes of R - in this case, they are already 
computed in Example l4.7l Next, for each ©-node represent- 
ing the elimination of a variable, say a, we create a new 
variable a' with the assignments of a present at that node. 
In contrast to a, the assignments of a' are re- weighted by the 
probability of that ©-node so that the sum of their weights 
is 1. The new variables and their weighted assignments are 
given in Figure [5] along the original ws-tree R and in the 
AW relation to be added to the world table W. 

When we return from recursion, we also compute the new 
ws-sets Ui from Ui. These ws-sets are equal in case of leaves 
and ©-nodes, but, in case of ©-nodes, the variable elimi- 
nated at that node is replaced by the new one we created. 
In case of © and © nodes, we also return the union of all U\ 
of their children. We finally return from the first call with 
the following ws-set U': 
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□ 

Let us view a probabilistic database semantically, as a set 
of pairs (I,p) of instances I with probability weights p. 

Theorem 5.3. Given a representation of probabilistic da- 
tabase W = {(7i,pi), . . . , (In,Pn)} and a ws-tree R identify- 
ing a nonempty subset of the worlds of W, the algorithm of 
Figure^ computes a representation of probabilistic database 

{(IjA) I (Ii,p S )€W,Iieu(R)} 

such that the probabilities pj add up to 1. 

Thus, of course, c is the confidence of R. 

Three simple optimizations of this algorithm that simplify 
the world table W and the output ws-descriptors are worth 
mentioning. 

1. Variables that do not appear anywhere in the U-relations 
can be dropped from W . 



2. Variables with a single domain value (obviously of weight 
1) can be dropped everywhere from the database. 

3. Variables x 1 and x" obtained from the same variable 
x (by creation of a new variable in the case of vari- 
able elimination on x in two distinct branches of the 
recursion) can be merged into the same variable if the 
alternatives and their weights in the W relation are the 
same. In that case we can replace x by x everywhere 
in the database. 

Example 5.4. In the previous example, we can remove 
the variables y' ,z' , and v' from the J-F-relation and all vari- 
able assignments involving these variables from the U-relation 
because of (1). Furthermore, we can remove the variables x 
and z because of (1). The resulting database is 
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Finally, we state an important property of conditioning 
(expressed by the assert operation) useful for query opti- 
mization. 

Theorem 5.5. Assert-operations commute with other as- 
serts and the operations of positive relational algebra. 

6. WS-DESCRIPTOR ELIMINATION 

We next present an alternative to exact probability com- 
putation using ws-trees based on the difference operation 
on ws-sets, called here ws-descriptor elimination. The idea 
is to incrementally eliminate ws-descriptors from the input 
ws-set. Given a ws-set 5 and a ws-descriptor di in S, we 
compute two ws-sets: the original ws-set S without di, and 
the ws-set representing the difference of {di} and the first 
ws-set. The probability of S is then the sum of the probabil- 
ities of the two computed ws-sets, because the two ws-sets 
are mutex, as stated below by function P w : 

P w (<D)=0 P W ({0}) = 1 

Pw{S)=P w {{d 2 ,...,d n })+ P W 

de({di}-{d 2 ,...,<M) 

The function P computes here the probability of a ws-descriptor. 

Example 6.1. Consider the ws-set {di, ^2,^3} of Exam- 
ple 13.11 The ws-descriptor d 2 is mutex with both d\ and ^3 



and we can eliminate it: P w ({di, d2, ds}) = P w ({di, cfe}) + 
P(da). We now choose any to eliminate d3 and obtain 
Pw{{di,d 3 }) = P w {{d 3 } - {dj}) + P(di) = P(di), as ex- 
plained in Example 13.31 Thus P m ({di, d2, ^3}) = P{d 2 ) + 
P(di) = 1. □ 

This method exploits the fact that the difference operation 
preserves the mutex property and is world-set monotone. 

Lemma 6.2. The following equations hold for any ws-sets 
Si, 82, and S3: 

u(Si — S2) C uj(Si) 
u{Si) Ulu{S 2 ) = w(5i -S2)UL0(S2) 

= LU(S!- S2)ncu(S2) 

w(&)nw(5 2 ) = => lj(Si- s 3 )n^(s 2 - s 3 ) = 

The correctness of probability computation by ws-descriptor 
elimination follows immediately from Lemma 16.21 

Theorem 6.3. Given a ws-set S, the function P w com- 
putes the probability of S. 

As a corollary, we have that 

n 

Corollary 6.4 (Theorem 16. 3|) . A ny ws-set [J {di} has 

i=l 

n— 1 n 

the equivalent mutex ws-set [J ({di} — U {dj}) U {d n }. 

i=l j=i+i 

Like the translation of ws-sets into ws-trees, this method 
can take exponential time in the size of the input ws-set. 
Moreover, the equivalent mutex ws-set given above can be 
exponential. On the positive side, computing the exact 
probability of such mutex ws-sets can be done in linear time. 
Additionally, the probability of {d} — Sd can be computed on 
the fly without requiring to first generate all ws-descriptors 
in the difference ws-set. This follows from the fact that 
the difference operation on ws-descriptors only generates 
mutex and distinct ws-descriptors. After generating a ws- 
descriptor from the difference ws-set we can thus add its 
probability to a running sum and discard it before gener- 
ating the next ws-descriptor. The next section reports on 
experiments with an implementation of this method. 

7. EXPERIMENTS 

The experiments were conducted on an Athlon-X2 (4600+) 
x86-64bit/1.8GB/ Linux 2.6.20/gcc 4.1.2 machine. 

We considered two synthetic data sets. 

TPC-H data and queries. The first data set consists of 
tuple- independent probabilistic databases obtained from re- 
lational databases produced by TPC-H 2.7.0, where each 
tuple is associated with a Boolean random variable and the 
probability distribution is chosen at random. We evaluated 
the two Boolean queries of Figure \W\ on each probabilis- 
tic database and then computed the probability of the ws- 
set consisting of the ws-descriptors of all the answer tuples. 
Among the two queries, only the second is safe and thus 
admits PTIME evaluation on tuple-independent probabilis- 
tic databases [9]- As we rewrite constraints into Boolean 
queries, we consider this querying scenario equally relevant 
to conditioning. 
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Q\\ select true from customer c, orders o, lineitem 1 

where c.mktsegment = 'BUILDING' and c.custkey = o.custkey 
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Figure 10: TPC-H scenario: Queries, data characteristics, and performance of INDVE(minlog). 
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Figure 11: The two cases when the numbers of variables and of ws-descriptors differ by orders of magnitude. 



#P-hard cases. The second data set consists of ws-sets 
similar to those associated with the answers of nonhierarchi- 
cal conjunctive queries without self-joins on tuple-independent 
probabilistic databases, i.e. join queries such as Q s = -Ri M 
■ ■ • X R a for schemas Ri(Ai, Ai+i) in which all relations are 
joined together, but there is no single column common to 
all of them. Such queries are known to be #P-hard [9] . 

The data generation is simple: we partition the set of 
variables into s equally-sized sets Vi,...,V a and then sample 
ws-sets {xi 1— » 01, • . • ,x a 1— » a s } where Xi is from Vi and ai 
is a random alternative for xt, for 1 < i < s. It is easy to 
verify that each such ws-set is actually the result of query 
Q s on some tuple-independent probabilistic database. (For 
s = 3 this fact is used in the #P-hardness proof of [9].) 

We use the following parameters in our experiments: num- 
ber n of variables ranging from 50 to 100K, number r of 
possible alternatives per variable (2 or 4), length s of ws- 
descriptors, which equals the number of joined relations (2 
or 4), and number w of ws-descriptors ranging from 5 to 60K. 
For each variable, the alternatives have uniform probabili- 
ties 1/r: our exact algorithms are not sensitive to changing 
probability values as long as the numbers of alternatives of 
the variables remain constant. 

Note that the focus on Boolean queries means no loss of 
generality for confidence computation; rather, the projection 
of a query result to a miliary relation causes all the ws-sets 
to be unioned and large. 

Algorithms. We experimentally compared three versions 
of our exact algorithm: one that employs independent parti- 
tioning and variable elimination (INDVE) , one that employs 
variable elimination only (VE), and one with ws-descriptor 
elimination (WE) . We considered INDVE with the two heu- 
ristics minlog and minmax. These implementations compute 
confidence values and the modified world table (AW in Ex- 
ample [fx2)l . but do not materialize the modified, conditioned 
U-relations (U' in Example 15. 2p . We have verified that the 
computation of these additional data structures adds only 



a small overhead over confidence computation in practice. 
We therefore do not distinguish in the sequel between confi- 
dence computation and conditioning. Note that our imple- 
mentation is based on the straightforward composition of 
the ComputeTree and conditioning algorithms and does not 
need to materialize the ws-trees. 

Although we also implemented a brute-force algorithm for 
probability computation, its timing is extremely bad and not 
reported. At a glance, this algorithm iterates over all worlds 
and sums up the probabilities of those that are represented 
by some ws-descriptors in the input ws-set. We also tried 
a slight improvement of the brute-force algorithm by first 
partitioning the input ws-set into independent subsets [22] . 
This version, too, performed bad and is not reported, as the 
partitioning can only be applied once at the beginning on 
the whole ws-set, yet most of our input ws-sets only exhibit 
independence in the context of variable eliminations. 

We experimentally compared INDVE against a Monte 
Carlo simulation algorithm for confidence computation [211 
[9] which is based on the Karp-Luby (KL) fully polyno- 
mial randomized approximation scheme (FPRAS) for DNF 
counting 17] . Given a DNF formula with m clauses, the 
base algorithm computes an (e, <5)-approximation c of the 
number of solutions c of the DNF formula such that 

Pr[|c-c] < e ■ c] > 1 - S 

for any given < e < 1, < 5 < 1. It does so within 
[~4 ■ m • log(2/<5)/e 2 ] iterations of an efficiently computable 
estimator. This algorithm can be easily turned into an (e, 8)- 
FPRAS for tuple confidence computation (see [H]). In our 
experiments, we use the optimal Monte-Carlo estimation al- 
gorithm of [§]■ This is a technique to determine a small suf- 
ficient number of Monte-Carlo iterations (within a constant 
factor from optimal) based on first collecting statistics on the 
input by running the Monte Carlo simulation a small num- 
ber of times. We use the version of the Karp-Luby unbiased 
estimator described in the book [24] . which converges faster 
than the basic algorithm of [17] . adapted to the problem of 



Number of variables close to ws-set size, 70 variables, r=4, s=4 



INDVE heuristics; 100K variables, r=4(2), s=4 



10000 
1000 
100 
10 

1 

0.1 
0.01 



indve(ymax 
indve(median 
kl(e.001 
indve(ymin 




90 200 500 825 
Size of ws-set (In scale) 



5000 



Figure 12: Performance of INDVE and KL when 
numbers of variables and ws-descriptors are close. 

computing confidence values. This algorithm is similar to 
the self-adjusting coverage algorithm of [18] . 

1. Queries on TPC-H data. Figure [TOl shows that IN- 
DVE(minlog) performs within hundreds of seconds in case of 
queries with equi-joins (Qi) and selection-projection (Q2) on 
tuple-independent probabilistic TPC-H databases with over 
700K variables and 60K ws-descriptors. In the answers of 
query Q2, ws-descriptors are pairwise independent, and IN- 
DVE can effectively employ independence partitition, mak- 
ing confidence computation more efficient than for Qi. 

The remaining experiments use the second data generator. 

2. The numbers of variables and of ws-descriptors 
differ by orders of magnitude. If there are much more 
ws-descriptors than variables, many ws-descriptors share vari- 
ables (or variable assignments) and a good choice for vari- 
able elimination can effectively partition the ws-set. On 
the other hand, independence partitioning is unlikely to be 
very effective, and the time for checking it is wasted. Fig- 
ure lllf a - ) shows that in such cases VE and INDVE (with 
minlog heuristic) are very stable and not influenced by fluc- 
tuations in data correlations. In particular, VE performs 
better than INDVE and within a second for 100 variables 
with domain size 4 (and nearly the same for 2) , ws-descriptors 
of length 4, and ws-set size above 1.2k. We witnessed a sharp 
hard-easy transition at 1.2k, which suggests that the com- 
putation becomes harder when the number of ws-descriptors 
falls under one order of magnitude greater than the number 
of variables. Experiment 3 studies easy-hard-easy transi- 
tions in more detail. The plot data were produced from 25 
runs and record the median value and ymin/ymax for the 
error bars. 

In case of many variables and few ws-descriptors, the in- 
dependence partitioning clearly pays off. This case natu- 
rally occurs for query evaluation on probabilistic databases, 
where a small set of tuples (and thus of ws-descriptors) is se- 
lected from a large database. As shown in Figure [TTT b). IN- 
DVE(minlog) performs within seconds for the case of 100K 
variables and 100 to 6K ws-descriptors of size s = 2, and 
with variable domain size r = 4. Two further findings are 
not shown in the figure: (1) VE performs much worse than 
INDVE, as it cannot exploit the independence of tuples and 
thus creates partitions that overlap at large; (2) the case of 
s = 4 has a few (2 in 25) outliers exceeding 600 seconds. 

3. The numbers of variables and of ws-descriptors 
are close. It is known from literature on knowledge compi- 
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Figure 13: Heuristics: minmax versus minlog. 

lation and model counting 6 that the computation becomes 
harder in this case. Figure [T2] shows the easy-hard-easy 
pattern of INDVE(minlog) by plotting the minimal, maxi- 
mal, and median computation time of 20 runs (max allowed 
time of 9000s). We experimentally observed the expected 
sharp transitions: When the numbers of ws-descriptors and 
of variables become close, the computation becomes hard 
and remains so until the number of ws-descriptors becomes 
one order of magnitude larger than the number of variables. 
The behavior of WE (not shown in the figure) follows very 
closely the easy-hard transition of INDVE, but in our exper- 
iment WE does not return anymore to the easy case within 
the range of ws-set sizes reported on in the figure. 

4. Exact versus approximate computation. We ex- 
perimentally verified our conjecture that the Karp-Luby ap- 
proximation algorithm (KL) converges rather slowly. In case 
the numbers of variables and of ws-descriptors differ by or- 
ders of magnitude, INDVE(minlog) and VE(minlog) are def- 
initely competitive when compared to KL with parameters 
e = 0.1 resp. e = 0.01, and S = 0.01, see Figure [Til 

In Figure [TlT bL KL uses about the same number of itera- 
tions for all the ws-set sizes, a sufficient number to warrant 
the running time. The reason for the near-constant line for 
KL is that for s = 2 and 100k variables, ws-descriptors are 
predominantly pairwise independent, and the confidence is 
close to 1 — (3 /4) w , where w is the number of ws-descriptors. 
But this quickly gets close to 1, and the optimal algorithm 
can decide on a small number of iterations that does not 
increase with w. In case the numbers of variables and ws- 
descriptors are close (Figure [T2)). KL with e = 0.001 only 
performs better than INDVE(minlog) in the hard cases. 

5. Heuristics for variable elimination. Figure [T3l shows 
that, although the minmax heuristic is cheaper to compute 
than the minlog heuristic, using minlog we find in general 
better choices of variables and INDVE remains less sensitive 
to data correlations. The plot data are produced from 10 
runs and show the median value and ymin/ymax for the 
error bars. Although VE exceeds the allocated time of 600 
seconds for different data points, it does this less than five 
times (the median value is closer to ymin). 

8. RELATED WORK 

To the best of our knowledge, this paper is the first to 
study the conditioning problem for probabilistic databases. 
In this section, we survey related work in the areas of prob- 
abilistic databases and knowledge compilation procedures. 

U-relations capture most other representation formalisms 
for uncertain data that were recently proposed in the litera- 
ture, including those of MystiQ [9] , Trio !5 ] , and MayBMS [3] . 



For each of these formalisms, natural applications in data 
cleaning and other areas have been described [5] [4] [9]. 

Graphical models are a class of rich formalisms for rep- 
resenting probabilistic information which perform well in 
scenarios in which conditional probabilities and a known 
graph of dependencies and independences between events 
are available. There are, for instance, Bayesian network 
learning algorithms that produce just such data. Unfortu- 
nately, if probabilistic data is obtained by queries on tuple- 
independent or similar databases, the corresponding graph- 
ical models tend to be relatively flat [23] but have high 
tree-width, which causes techniques widely used for confi- 
dence computation on graphical models to be highly ineffi- 
cient. Graphical models are more succinct than U-relations, 
yet their succinctness does not benefit the currently known 
query evaluation techniques. This justifies the development 
of conditioning techniques specifically for the c-table-like 
representations (such as U-relations) developed by the data- 
base community. 

It has been long known that computing tuple confidence 
values on DNF-like representations of sets of possible worlds 
is a generalization of the DNF model counting problem and 
is #P-complete [10| . Monte Carlo approximation techniques 
for confidence computation have been known since the orig- 
inal work by Karp, Luby, and Madras [18]. Within the 
database field, this approach has first been followed in work 
on query reliability [14] and in the MystiQ project [9]. Sec- 
tion[7]reports on an experimental comparison of approxima- 
tion and our exact algorithms. 

Our variable elimination technique is based on Davis-Put- 
nam procedure for satisfiability checking [12] . This pro- 
cedure was already used for model counting 6 . Our ap- 
proach combines it with independent partitioning for ef- 
ficiently solving two more difficult problems: exact confi- 
dence computation and conditioning. 6] uses the minmax 
heuristic (which we benchmark against) and discusses ex- 
periments for CNF formulas with up to 50 variables and 
200 clauses only. Our experiments also discuss new settings 
that are more natural in a database context: for instance, 
when the size of a query answer (and thus the number of 
ws-descriptors) is small in comparison to the size of the in- 
put database (and thus of variables). Follow-up work [15] 
reports on techniques for compiling ws-sets generated by 
conjunctive queries with inequalities into decision diagrams 
with polynomial-time guarantees. 

Finally, there is a strong connection between ws-trees and 
ordered binary decision diagrams (OBDDs). Both make the 
structure of the propositional formulas explicit and allow for 
efficient manipulation. They differ, however, in important 
aspects: binary versus multistate variables, same variable 
ordering on all paths in case of OBDDs, and the new ws- 
tree ®-node type, which makes independence explicit. It is 
possible to reduce the gap between the two formalisms, but 
this affects the representation size. For instance, different 
variable orderings on different paths allows for exponentially 
more succinct BDDs [20) . Multistate variables can be easily 
translated into binary variables at a price of a logarithmic 
increase in the number of variables [25] . 
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APPENDIX 

Proof of Theorem 1441 

We prove that the translation from ws-sets to ws-trees is 
correct. That is, given a ws-set S, ComputeTree(S') and S 
represent the same world-set. 

We use induction on the structure of ws-trees. In the 
base case, we map ws-sets representing the empty world-set 
to _L, and ws-sets containing the universal ws-descriptor 
(that represents the whole world-set) to 0. We consider now 
a ws-set S. We have two cases corresponding to the different 
types of ws-tree inner nodes. 

Case 1 . Assume S = [J Si with Si pairwise independent 

i£I 

and Ri = ComputeTree(Si). By hypothesis, cu(Ri) = u>(Si). 
Then, ComputeTree(S) = ®{Ri) and a;(ComputeTree(S')) = 

\Ju(Ri) = \Jlu{Si)=lu{S). 

Case 2. Let i be a variable in S and consider the ws- 
sets S IM i (i G dom^) and T as given by ComputeTree. 
Because the whole world-set can be represented by A = 
U {{x i — ► i}}, it holds that ui{A) n u(S) = lo(S). We 
iedom x 

push the assignments of x in each ws-descriptor of S and 
obtain 

w(5)=w( \J {du{x i-> i} | d £ S}). 
igdom x 

We can remove all inconsistent ws-descriptors in the ws-set 
of the right-hand side while preserving equivalence: 

uj({du{x >-> %} | d e S}) = 

cu{{d U {x i} | d e S, jBj E dom x : j ^i,{x h-> j} C d}) = 

ui({d u{iHi}|{iHi}C(ie 5})u 

u({d U {x i} | d 6 5, flj e dom x : {x ^ j} C d}) = 
u(S x ^i) U w(T) = ufS^, U T) 

We now consider all values i € dom x and obtain 

u(8)=u( y (&„,ut)) 

ie dom x 

= w( iHi:(S„,UT)). 
is dom, 

Proof of Theorem 15.31 

We prove that given a representation of probabilistic data- 
base W = {(ii,pi), . . . , (I n ,Pn)} and a ws-tree R identify- 
ing a nonempty subset of the worlds of W, the algorithm of 
Figure|8]computes a representation of probabilistic database 

{(i j A)\(i j , Pj )ew,i j eu J (R)} 

such that the probabilities pj add up to 1. 

The conditioning algorithm computes the probability c of 
each node of the input ws-tree R as given by our probabil- 
ity computation algorithm of Figure [7] We next consider 
the correctness of renormalization using induction on the 
structure of the input ws-tree. 

Base case: The ws-tree represents the whole world-set 
and we thus return U unchanged (no conditioning is done). 

Induction cases Cg> (independent partitioning) and © (vari- 
able elimination). For both node types, we return the union 
of ws-sets U'i that are the ws-sets Ui C U where the variables 



encountered at the nodes on the recursion path are replaced 
by new ones. The ws-sets Ui are the subsets of U consis- 
tent with each child of the © or <g> node. By hypothesis, the 
ws-sets Ui are conditioned correctly. In case of Cg>-nodes, no 
further conditioning is done, because no re-weighting takes 
place. In case of a ®-node, we re-weight the assignments of 
the variable eliminated at that node. 

Let / C dom^ be the set of alternatives of x present at 
that node. Since 

p ( R ) = p ( 0(* -» i ■■ ^)) = E p « x ~ ■ p (^)> 

iei iei 
if we create a new variable x' , 

P(K~,}):= ^7 ( »- f < fi -> . 

This guarantees that 

p(Q(x'»i:R i )) = l. 

iei 

If we ask which tuples of U should be in an instance 
satisfying R, the answer is of course all those whose ws- 
descriptors are consistent with one of the ws-descriptors in 
x i — ► i : Ri for some i £ I. The [/-relation tuples in the 
results of the invocations cond(i? 4 , Ui) grant exactly this. 



